========================================================

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

This report contains red wines data with 11 variables for red wine chemical properties and one variable for expert quality rating between 0 (very bad) and 10 (very excellent). Using this data, we will try to find which chemical properties influence the quality of red wines?

Univariate Plots Section

## [1] 1599   12
## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

This data contains 12 variables and 1599 observation. All the variables are numeric

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

The summary for the each variable.

## 
##       Bad      good very_good 
##        63      1319       217

The quality range is between 3 and 8 with most of the rating in the median (6 or 7) values. I create a new column maned quality.level. This column divided the data into 5 groups (this data set contain 3 groups only).

The fixed acidity is right skewed with the highest values around 7.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

In the second plot I limited the values to 1 to have a closer look to the data. Most of the data is between 0.3 and 0.6.

##     fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 152           9.2             0.52           1            3.4      0.61
##     free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates
## 152                  32                   69  0.9996 2.74         2
##     alcohol quality quality.level
## 152     9.4       4           Bad

There is a lot of data with 0.0 value in the Citric Acid. Also, there is another peak in 0.49. The second plot I subtract the value of 1.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

## 
##  0.9  1.2  1.3  1.4  1.5  1.6 1.65  1.7 1.75  1.8  1.9    2 2.05  2.1 2.15 
##    2    8    5   35   30   58    2   76    2  129  117  156    2  128    2 
##  2.2 2.25  2.3 2.35  2.4  2.5 2.55  2.6 2.65  2.7  2.8 2.85  2.9 2.95    3 
##  131    1  109    1   86   84    1   79    1   39   49    1   24    1   25 
##  3.1  3.2  3.3  3.4 3.45  3.5  3.6 3.65  3.7 3.75  3.8  3.9    4  4.1  4.2 
##    7   15   11   15    1    2    8    1    4    1    8    6   11    6    5 
## 4.25  4.3  4.4  4.5  4.6 4.65  4.7  4.8    5  5.1 5.15  5.2  5.4  5.5  5.6 
##    1    8    4    4    6    2    1    3    1    5    1    3    1    8    6 
##  5.7  5.8  5.9    6  6.1  6.2  6.3  6.4 6.55  6.6  6.7    7  7.2  7.3  7.5 
##    1    4    3    4    4    3    2    3    2    2    2    1    1    1    1 
##  7.8  7.9  8.1  8.3  8.6  8.8  8.9    9 10.7   11 12.9 13.4 13.8 13.9 15.4 
##    2    3    2    3    1    2    1    1    1    2    1    1    2    1    2 
## 15.5 
##    1
## 
##        dry    off_dry      sweet very_sweet 
##          2       1357        232          8

When we take the log for the plot, we can see that there are a lot of values without or with very low data. The peak of the data is around 2. After 4 the count of the data per value is less than 10. To have better view for the sugar level I created a new column “sugar.level” based on wine folly website (https://winefolly.com/review/sugar-in-wine-chart/) I divided the sugar level into: dry –> below 1.2 off-dry –> 1.2 - 3 sweet –> 3 - 12 very sweet –> above 12 based on the new sugar levels, most of the wine has off dry sugar level flowed by sweet.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The second plot shows the log10 of the data where it better shows the outliers. Most of the data id between 0.06 and 0.112.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Both of Free Sulfur Dioxide and Total Sulfur Dioxide are right skewed. limiting the Total Sulfur Dioxide data to 175 give us better image for the data.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0037

The Density has a normal distributed shape around 0.9965

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

## # A tibble: 3 x 2
##   `rw$pH.level`     n
##   <chr>         <int>
## 1 high             48
## 2 low             726
## 3 moderate        825

pH values have a normal distributed shape. To better understand the pH affect I create a new column “pH.level” based on winespectator website (https://www.winespectator.com/drvinny/show/id/5035): 3.3 to 3.6 –> best lower than 3.3 –> low higher than 3.6 –> high most of the pH values are moderate followed by low.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

The graph is skewed to the right. The majority of data is between 0.5 and 0.7.

The shape of the data is right skewed with the peak around 7.

Univariate Analysis

What is the structure of your dataset?

there are 1599 red wine records with 11 column for its chemical characteristics (fixed.acidity, volatile.acidity, citric.acid, residual.sugar, Chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol) and one column represent expert rating (quality). All the variables are numeric. Then I create 3 more columns (quality.level, sugar.level, & pH.level)

Other observations: * Most of the quality rating is in the median (6 or 7) with minimum of 3 and maximum of 8 * There is a lot of data with 0.0 value in the Citric Acid. * Most sugar values are around 2 * The Density and pH values are normally distributed * Fixed acidity, sulphates, Free Sulfur Dioxide, and Total Sulfur Dioxide are right skewed.

What is/are the main feature(s) of interest in your dataset?

The main feature in the data set is the quality. I would like to see how other features affect it.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think sugar, citric acid, and alcohol will effect the quality.

Did you create any new variables from existing variables in the dataset?

Yes, I create 3 new columns:

quality.level: based on the quality column I divided the quality inti: quality –> quality.level 1 - 2 –> very bad (no values) 3 - 4 –> Bad 5 - 6 –> Good 7 - 8 –> very good 9 - 10 –> excellent (no values)

sugar.level: based on wine folly website (https://winefolly.com/review/sugar-in-wine-chart/) I divided the sugar values into: dry –> below 1.2 off-dry –> 1.2 - 3 sweet –> 3 - 12 very sweet –> above 12

pH.level: based on winespectator website (https://www.winespectator.com/drvinny/show/id/5035) I divided the pH values into: 3.3 to 3.6 –> best lower than 3.3 –> low higher than 3.6 –> high

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I log-transformed the residual sugar to have better view of the plot, we can see that there are alot of values without or with very low data. I log-transformed the Chlorides plot where it better shows the outliers and the gap after the first value 0.012.

Bivariate Plots Section

scatter plot for fixed acidity per quality group. The red line represents the mean. The blue lines represent 0.05 and 0.95 quantiles

Box plot for fixed acidity per quality level

scatter plot for volatile acidity per quality level

box plot for citric acid per quality level

In both graphs the blue lines represent the 0.05 & 0.95 quantiles. The red line represents the mean (0.5 quintile). The lower the volatile acidity the higher the quality. For the fixed acidity and citric acid based on the boxplots the higher the value the better the quality.

The blue lines show the 0. 95 quintile and 0.05 quantile. The red line shows the mean (0.5 quintile)

most of the very good wine id off-dry which mean the residual sugar is below 3 and higher than 1.2. generally speaking most wine have off-dry or sweet regardless to it quality.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

The mean of very good quality wine is lower than the rest.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Boxblot for the free sulfur dioxide after and the qyality level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Boxblot for the log of free sulfur dioxide after and the qyality level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Boxblot for the total sulfur dioxide after and the qyality level

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Boxblot for the log of total sulfur dioxide after and the qyality level

Both the total and free sulfur dioxide having a low value is better.

> also for density the very good quality has the lowest mean

The bad quality has the highest mean and the very good quality has the lowest mean. very good quality mean is around 3.25. Most of the pH values is low or moderate

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

> In the sulphates the very good has the highest mean.

clearly that the higher quality has the highest mean of alcohol

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

I observe the relationship between the quality group and all other variable.

fixed acidity, volatile acidity, and citric acid: Both fixed acidity and citric acid the mean increases in better quality. for volatile acidity is the opposite, the mean decreases in better quality.

residual sugar: Most of the wine values id under the off-dry followed by sweet. There is not a specific characteristic to distinguish between good and bad wine.

chlorides: The mean of very good quality wine is lower than the rest.

total sulfur dioxide and free sulfur dioxide: For both the total and free sulfur dioxide, having a lower value is better.

density and pH: For both the very good quality has the lowest mean.

sulphates: the better the quality the higher the mean

alcohol: the better the quality the higher the mean

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

No.

What was the strongest relationship you found?

the increase in alcohol, sulphates, and citric acid with quality group and the decrease in volatile acidity with the quality group.

Multivariate Plots Section

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

## <ScaleContinuousPosition>
##  Range:  
##  Limits:    0 --    1

The higher the alcohol level the lower the density.

The higher the sugar value the lower the alcohol value. Very good wine has off dry to sweet sugar level and higher value of alcohol

The very sweet wine shows in good quality, the alcohol level is low for the very sweet wine. The density decreases if the sweetness level decrease.

## [1] "The mean of volatile acidity for low pH level:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.370   0.480   0.494   0.600   1.240
## [1] "The mean of volatile acidity for moderate pH level:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.4200  0.5600  0.5531  0.6600  1.5800
## [1] "The mean of volatile acidity for high pH level:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2200  0.4675  0.5900  0.6054  0.6625  1.1850

I removed one value which has citric acid value equal to 1. the number of vales with high pH values is very small (48) specially in the very good wine. In the very good quality, the lower the pH level the lower the volatile acidity mean and the higher the citric acid value.

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality      quality.level      sugar.level          pH.level        
##  Min.   :3.000   Length:1599        Length:1599        Length:1599       
##  1st Qu.:5.000   Class :character   Class :character   Class :character  
##  Median :6.000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :5.636                                                           
##  3rd Qu.:6.000                                                           
##  Max.   :8.000

the value for good and very good quality are similar; however, the very good quality has higher sulphates mean.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

pH, volatile acidity, and citric acid: * To have very good wine we need to balance the citric acid value and the pH value. The lowest the pH the higher the citric acid. This relationship is clear in the very good wine only. * for very good wine, low pH values have lower volatile acidity. * The lower the pH level the lower the volatile acidity means.

Free sulfur dioxide, sulphates, & quality group: * the very good quality has highest sulphates mean * there is NO relationship between the free sulfur dioxide and the sulphates.

Density, alcohol, & sugar level: * The density decreases if the sweetness level decrease. * The alcohol level is low for the very sweet wine * Having a better wine mean having higher alcohol level and off-dry to sweet level of sugar

Were there any interesting or surprising interactions between features?

To have very good wine we need to balance the citric acid value and the pH value.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.


Final Plots and Summary

Plot One

Description One

The chlorides percentage is responsible for the saltness of the wine. The plot shows that the very good wine has lower mean.

Plot Two

Description Two

  • Better quality has higher amount of alcohol
  • Most of the wines under off dry sugar level
  • The sweeter the wine the lower the alcohol value
  • Very good wine has off-dry to sweet level of sugar and has the highest mean of alcohol value.

Plot Three

Description Three

volatile acidity gives a vinegar taste to the wine, for better wine this value decreases citric acid gives some freshness and flavor to the wine, for better wine this value increases pH, the closer the value to 0 the more acidic. Very good wine has the lower mean To have very good wine we need to balance the citric acid value and the pH value. The lowest the pH the higher the citric acid. This relationship is clear in the very good wine only.


Reflection

This data contains 1500 red wine records with 11 columns for its chemical characteristics and 1 column for expert rating. I create 3 more columns: quality level, for more classification for the quality rating, sugar level, based on the folly website I divided the sugar values into 4 levels, pH level, based on wine spectator website I divided the pH values into 3 levels. Before I start analysis below short explanation for the chemical characteristics: Fixed acidity: most acids, fixed or nonvolatile involved with wine Volatile acidity: too much give the taste of vinegar Citric acid: give the wine a taste of freshness Residual sugar: sugar value (I divide it into levels in suga level column) Chlorides: salt Free sulfur dioxide: prevent microbial growth and the oxidation of the wine Total sulfur dioxide: free + bound sulfur dioxide Density: the density of the water based on alcohol and sugar level pH: closer to 0 more acidic and closer to 14 more basic Sulphates: antimicrobial and antioxidant Alcohol: alcohol level I plot the variables to better know it distribution. For the quality, most of the wine values are under good quality (5 or 6 rating). For the sugar level, most of the values are under the off dry level. The density and pH values have a normal distribution. Alcohol, sulphates, total sulfur dioxide, free sulfur dioxide, and fixed acidity are right skewed. After that I observed each variable with the quality variable to find of there is a relationship between them. I found that the volatile acidity, chlorides, total and free sulfur dioxide, density, and pH values decrease for better quality. And the fixed acidity, citric acid, sulphates, and alcohol increase with better quality. To Identify if a combined the variables can affect the quality, I used the multivariate analysis and found that for a very good wine quality there is a balance between the citric acid and pH values. The lowest the pH the higher the citric acid. This relationship is clear in the very good wine only. There is other relation between the sugar level and the alcohol value, the higher the sugar level the lower the alcohol value. Having a better wine mean having higher alcohol level and off-dry to sweet level of sugar.